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The  APRAM  -  The  Rounds  Complexity  Measure  and 
the  Exphcit  Costs  of  Synchronization* 

Richard  Cole  Ofer  Zajicek 

New  York  University 

Abstract 

This  paper  studies  the  explicit  costs  of  synchronization  by  examining  an  eisynchronous 
generalization  of  the  PRAM  model  called  the  APRAM  model.  The  APRAM  model 
and  its  associated  complexity  measure,  the  rounds  complexity,  are  defined  and  then 
illustrated  by  designing  and  analyzing  two  algorithms:  a  parallel  summation  algoritiim 
which  proceeds  along  an  implicit  complete  binary  tree  and  a  recursive  doubling  algorithm 
which  proceeds  along  a  linked  list.  In  both  ccises  replacing  global  synchronization  with 
local  synchronization  yields  algorithms  with  reduced  complexity. 

1      Introduction 

In  this  paper  we  consider  the  effect  of  process  asynchrony  on  parallel  algorithm  design.  As 
is  well  known,  the  main  effort  in  parallel  algorithm  design  has  employed  the  PRAM  model. 
This  model  hides  many  of  the  implementation  issues,  allowing  the  algorithm  designer  to  focus 
first  and  foremost  on  the  structure  of  the  computational  problem  at  hand  -  synchronization 
is  one  of  these  hidden  issues. 

This  paper  is  part  of  a  broader  research  effort  which  has  sought  to  take  into  account  some 
of  the  implementation  issues  hidden  by  the  PRAM  model.  Broadly  speaking,  two  major 
approaches  have  been  followed.  One  body  of  research  is  concerned  with  asynchrony  and  the 
resulting  non-uniform  environment  in  which  processes  operate^  [CZ89,CZ90c,Nis90,MPS89], 
[MSP90].  The  other  body  of  research  has  considered  the  effect  of  issues  such  as  latency  to 
memory,  but  assumes  a  uniform  environment  for  the  processes  [PU87,PY88,AC88,ACS89], 
[Gib89]. 


'The  work  was  supported  in  part  by  NSF  grants  CCR-8902221  and  CCR-8906949,  and  by  a  John  Simon 
Guggenheim  Memorial  Foundation  Fellowship. 

We  distinguish  between  processes  and  processors  in  order  to  emphasize  that  the  APRAM  is  not  a  machine 
model  but  rather  a  programming  model;  it  is  the  task  of  a  compiler  to  implement  the  programming  model 
on  actual  machines.  The  term  processor  will  be  used  to  refer  to  this  component  of  a  macliine. 


The  PRAM  is  a  synchronous  model  and  thus  it  strips  away  problems  of  synchronization. 
However,  the  implicit  synchronization  provided  by  the  model  hides  the  synchronization  costs 
from  the  user.  In  many  cases,  an  algorithm  may  have  to  be  redesigned  in  order  to  allow  it 
to  run  efficiently  in  an  asynchronous  environment.  In  this  paper,  we  are  concerned  with  the 
design  of  aJgorithms  which  perform  well  in  the  presence  of  the  non-uniformity  introduced 
by  asynchrony.  The  work  of  Nishimura  [Nis90]  is  similarly  motivated.  Another  approach 
to  the  problem  of  asynchrony  is  to  seek  to  efficiently  compile  PRAM  algorithms  to  operate 
in  asynchronous  environments;  this  approach  is  followed  by  Martel,  Park  and  Subramonian 
[MPS89,MSP90].  They  give  efficient  randomized  simulations  of  arbitrary  PRAM  algorithms 
on  an  asynchronous  model,  given  certain  architectural  assumptions  (e.g.  the  availability  of 
a  compare&swap  instruction).  It  is  not  clear  whether  similar  deterministic  compilers  exist; 
in  the  absence  of  such  compilers,  to  obtain  deterministic  algorithms  it  appears  necessary  to 
design  them  in  an  asynchronous  environment;  this  is  the  focus  of  our  work. 

Our  algorithm  design  is  targeted  at  the  APRAM  model,  defined  more  precisely  in  Sec- 
tion 3.  It  assumes  that  each  process  can  access  each  memory  location  in  unit  time.  A  non- 
uniform environment  arises  because  processes  may  proceed  at  considerably  different  speeds; 
in  addition,  the  same  process  may  proceed  at  different  speeds  at  different  times  during  the 
execution  of  the  algorithm.  We  study  the  effects  of  the  nonuniformity  on  the  complexity  of 
algorithms,  aiming  to  design  algorithms  which  perform  well  in  such  environments. 

Even  if  the  underlying  hardware  is  uniform  and  synchronous,  the  environment  may 
appear  nonuniform  at  the  process  level  for  several  reasons.  For  example,  it  may  appear 
nonuniform  due  to  multitasking  and  consequent  nonuniform  task  or  process  distribution,  or 
because  of  interactions  with  external  devices  (e.g.  interrupts  to  service  disk  I/O). 

We  hope  that  investigating  the  synchronization  costs  will  further  our  understanding  of 
some  of  the  issues  involved  in  parallel  computation.  The  better  we  understand  the  issues 
involved  the  more  likely  we  are  to  develop  better  abstract  computational  models,  which  will 
ultimately  guide  us  in  taking  advantage  of  the  power  of  parallelism. 

Now,  we  briefly  discuss  some  of  the  work  concerned  with  uniform  environments.  [PU87], 
[PY88,AC88]  show  that  the  communication  among  the  processes  can  be  reduced  by  design- 
ing algorithms  which  take  advantage  of  temporal  locality  of  access;  they  assumed  that  each 
(global)  memory  access  has  a  fixed  cost  associated  with  it.  In  a  related  paper  [ACS89],  Ag- 
garwal  et  al.  argue  that  typically  it  takes  a  substantial  time  to  get  the  first  word  from  global 
memory,  but  after  that,  subsequent  words  can  be  obtained  quite  rapidly.  They  introduce  the 
BPRAM  model  which  allows  block  transfers  from  shared  memory  to  local  memory.  They 
show  that  the  complexity  of  algorithms  can  be  reduced  significantly  by  taking  advantage  of 


spatial  locality  of  reference.  This  approach  implicitly  assumes  that  the  (virtual)  machine 
is  uniform,  comprising  a  collection  of  similar  processes.  In  addition  to  latency  to  memory, 
Gibbons  [Gib89]  studied  the  cost  of  process  synchronization  by  considering  an  asynchronous 
model.  Although  he  assumed  that  the  machine  is  asynchronous,  his  analysis  in  effect  as- 
sumes that  the  processes  are  roughly  similar  in  speed.  More  precisely,  he  required  that  read 
and  writes  be  separated  by  global  synchronization;  this  results  in  a  uniform  environment 
from  the  process  perspective. 

2      Overview 

The  RAM  is  a  very  popular  model  in  sequential  computation  and  algorithm  design.  There- 
fore, it  is  not  surprising  that  a  generalization  of  the  RAM,  the  PRAM,  became  a  popular 
model  in  the  field  of  parallel  computation.  We  use  another  generalization  of  the  RAM 
model,  called  the  APRAM,  suggested  by  KruskaJ  et  aJ.  [KRS88a];  a  formal  definition  of 
the  APRAM  model  is  given  in  Section  3.  It  includes  the  standard  PRAM  as  a  submodel. 
Roughly  speaking,  the  APRAM  can  be  viewed  as  a  collection  of  processes  which  share  mem- 
ory. Each  process  executes  a  sequence  of  basic  atomic  operations  called  events.  An  event 
can  either  perform  a  computation  locally  (by  changing  the  state  of  the  process)  or  access 
the  shared  memory;  accessing  the  memory  has  a  unit  cost  associated  with  it.  An  APRAM 
computation  is  a  serialization  of  the  events  executed  by  all  the  processes. 

Synchronization  costs  fall  into  two  categories:  explicit  costs  and  implicit  costs.  By  ex- 
plicit costs  we  mean  the  overhead  for  achieving  synchronization.  This  could  be,  for  instance, 
the  cost  of  executing  extra  code  that  must  be  added  to  the  algorithm  in  order  to  synchro- 
nize. When  processes  proceed  at  different  speeds,  if  the  algorithm  is  required  to  proceed  in 
lock  step,  the  time  required  to  execute  a  step  is  dictated  by  the  slowest  process.  By  implicit 
costs  we  refer  to  the  cost  associated  with  lock  step  execution  apart  from  the  explicit  costs  of 
synchronization.  We  do  not  define  the  notion  of  explicit  and  implicit  costs  formally,  for  we 
do  not  think  we  have  enough  experience  to  justify  definitive  definitions.  Rather  we  present 
them  as  concepts  which  are  exemplified  in  our  analysis. 

The  parallel  runtime  complexity  associated  with  the  PRAM  model  describes  the  com- 
plexity of  PRAM  algorithms  as  a  function  of  the  global  clock  provided  by  that  model.  The 
APRAM  model  does  not  have  such  a  clock.  Consequently,  it  is  necessary  to  define  com- 
plexity measures  to  replace  the  running  time  complexity.  We  replace  the  role  of  the  global 
clock  by  a  virtual  clock  and  we  eventually  investigate  three  complexity  measures:  the  rounds 
complexity,  the  unbounded  delays  complexity  and  the  bounded  delays  complexity;  however. 


in  this  paper,  we  only  discuss  the  rounds  complexity  measure. 

In  Section  4  we  define  the  rounds  complexity;  it  replaces  the  global  clock  used  by  the 
PRAM  model  by  a  virtual  clock.  This  approach  was  introduced  in  [PF77]  and  used  in 
[AFL83,LF81,KRS88b]  and  is  common  in  the  area  of  distributed  computing  (see  [Awe87], 
[Awe85,AG87]).  Consider  a  computation,  C.  A  virtual  clock  of  C  is  an  assignment  of  unique 
virtual  times  to  the  events  of  C;  the  times  assigned  are  a  non-decreasing  function  of  the 
event  number. 

The  virtual  clock  is  meant  to  correspond  to  the  "real"  time  at  which  the  operations  oc- 
curred in  one  possible  execution  of  the  algorithm,  called  a  computation.  The  time  difference 
between  two  consecutive  events  of  a  process  is  called  the  duration  of  the  later  event.  The 
length  of  a  computation  is  the  time  assigned  to  the  last  event  in  the  computation. 

The  rounds  complexity  of  an  algorithm  is  the  length  of  a  computation  maximized  over 
all  possible  computations;  i.e.,  maximized  over  all  possible  interleavings  of  the  events  of  the 
processes.  We  assume  that  the  duration  of  events  is  at  most  one.  In  effect,  in  the  rounds 
complexity  measure  the  slowest  process  defines  a  unit  of  time  (a  round);  the  complexity 
is  expressed  in  rounds.  Although  this  is  inadequate  for  measuring  the  implicit  costs  of 
synchronization,  using  the  rounds  complexity  we  are  able  to  measure  the  explicit  costs  of 
synchronization. 

We  analyze  two  algorithms  under  the  rounds  complexity  measure:  a  tree  based  parallel 
summation  algorithm  and  a  list  based  recursive  doubling  algorithm.  Both  are  fundamental 
algorithms  and  appear  as  subroutines  in  many  parallel  algorithms.  We  show  that  both 
algorithms  have  complexity  0(log7i)  rounds,  comparable  to  their  PRAM  parallel  run  time 
complexity  (assuming  a  linear  number  of  processes).  Recall  that  one  round  of  the  virtual 
clock  roughly  corresponds  to  one  parallel  time  step  of  the  PRAM  model. 

In  a  companion  paper  we  describe  and  analyze  a  more  substantial  algorithm;  an  algo- 
rithm for  finding  the  connected  components  of  an  undirected  graph  [CZ90b].  This  algorithm 
differs  substantially  from  all  known  PRAM  algorithms.  We  avoid  the  need  to  synchronize 
the  processes,  thereby  obtaining  an  algorithm  whose  behavior  appears  somewhat  chaotic. 
The  description  of  the  algorithm  is  relatively  simple  and  straightforward;  however,  due  to 
its  apparently  chaotic  nature  and  the  unpredictability  of  the  asynchronous  environment, 
its  analysis  is  quite  challenging.  We  show  that  the  rounds  complexity  of  the  algorithm  is 
O(logn)  rounds  assuming  a  linear  number  of  processes. 

In  a  second  companion  paper  we  study  the  implicit  costs  of  synchronization  by  allowing 
the  duration  of  events  to  be  larger  than  1  [CZ90a].  We  achieve  this  by  generalizing  the 
definition  of  a  virtual  clock  due  to  [PF77,AFL83,LF81].    We  also  replace  the  worst  case 


complexity  of  Section  4  with  a  probabilistic  analysis:  the  duration  of  an  event  is  assumed 
to  be  a  random  variable  with  a  known  probability  distribution. 

3     The  APRAM 

The  random  access  machine,  or  the  RAM,  is  the  standard  model  for  sequential  computation. 
It  is  therefore  natural  to  model  parallel  computation  by  extensions  of  the  RAM.  One  such 
extension,  the  PRAM,  has  become  a  widely  used  model  for  parallel  computation.  In  this 
thesis  we  consider  another  generalization  of  the  RAM  model.  The  generalization  incorpo- 
rates both  the  standard  RAM  model  and  the  PRAM  model,  as  well  as  the  APRAM  model, 
the  subject  of  this  research.  Henceforth  the  standard  RAM  model  is  called  the  sequential 
RAM;  the  term  RAM  is  reserved  for  the  new  generalized  model. 

3.1     The  (Generalized)  RAM 

A  computation  model  is  an  abstraction  whose  purpose  is  to  model  our  notion  of  a  computer. 
Intuitively,  a  computer  is  a  machine  with  memory  and  registers,  which  when  presented  with 
a  program  performs  actions.  The  actions  may  fetch  a  value  from  memory,  modify  a  memory 
location  or  perform  a  calculation  on  the  machine's  registers.  With  this  in  mind,  we  view 
the  RAM  as  a  model  which  when  presented  with  a  program  and  an  initial  input  produces 
computations;  this  is  made  more  precise  shortly.  The  RAM  comprises  a  state  register,  several 
computation  registers,  a  communication  register  and  an  address  space.  The  state  register 
together  with  the  computation  registers  form  a  RAM  state. 

The  RAM  performs  actions  which  are  described  in  terms  of  four  basic  operations:  A  load 
operation,  {Load,addr),  a  store  operation,  {Store, addr,  data),  a  /oca/ operation,  (Local), 
and  a  terminate  operation,  (Terminate).  The  address  space  corresponds  to  the  computer 
memory.  The  load  operation  corresponds  to  a  fetch  from  memory  and  a  store  operation 
corresponds  to  a  write  to  memory.  A  local  operation  modifies  the  state  of  the  RAM  but 
does  not  affect  the  address  space  (i.e.  the  memory).  The  terminate  operation  is  a  special  local 
operation  which  indicates  that  the  RAM  has  terminated.  The  details  of  these  operations  as 
well  as  the  purpose  of  the  various  registers  is  made  precise  shortly. 

The  computation  of  a  RAM  is  a  sequence  of  events  defined  by  a  program.  A  pro- 
gram is  a  total  mapping,  N  :  (Id,  State,  Comml)  >-*  (Op,  New  State).  An  event  is  any  tu- 
ple, (Id,CurState,Comml,Op,NewState,Comm2),  satisfying  N  (Id,  Cur  State,  Comml) 
=  (Op,  A''eit?5iafe),  where /(/ is  an  integer  called  the  event  identifier.  Cur  State  :id  NewState 
are  RAM  states,  Comml  and  Comm2  are  values  of  the  communication  register,  and  Op  is 


one  of  the  four  RAM  operations  named  above.  We  note  that  the  operation  Op  may  com- 
prise several  fields;  for  example,  the  load  operation  requires  an  address  field.  The  values  of 
the  required  fields  are  specified  by  the  program.  Constraints  on  Comm2  will  be  specified 
subsequently.  For  any  event,  e,  let  Id(e),  CurState{e),  Comml(e),  Op{e),  NewState{e), 
and  Comm2{e)  be  the  corresponding  members  of  the  event  tuple. 

A  RAM  computation  is  defined  to  comprise  an  input  and  a  sequence  of  events  satisfying 
two  constraints.  First,  the  value  returned  by  a  load  operation  is  the  most  recent  value  stored 
in  the  location  addressed,  if  any.  Otherwise,  the  value  returned  is  the  initial  value  assigned 
by  the  input  to  that  location.  More  formally,  a  computation  is  a  tuple  comprising  a  partial 
mapping,  /  :  addr  i— ►  value,  called  the  input,  and  an  infinite  sequence  of  events,  ei,e2, . . . 
satisfying  the  following  for  any  z  >  1: 

(RAM)  If  Op(e,)  =  {Load,  addr,)  then  let  j  be  the  largest  number  smaller  than  i,  if  any, 
for  which  Op{€j)  —  {Store,  addr {,  data)  for  some  value  data.  If  there  is  such  a  j  then 
Comm2(ei)  =  data,  otherwise  Comm2(ei)  =  I{addri). 

If  Op(e,)  is  not  a  load  operation,  Comm2(e,)  =  Comml(e,). 

This  specifies  the  semantics  of  the  load  and  store  operations. 

Second,  the  terminate  operation  indicates  that  the  RAM  terminated  and  its  semantics 
is  as  follows.  For  any  RAM  program,  N ,  any  state,  s,  any  communication  register  value,  c, 
and  any  identifier,  id,  if  N{id,s,c)  =  {terminate ,  s')  then  5  =  5'.  In  other  words,  for  any 
event,  e,  if  Op(e)  =  terminate  then  NewState{e)  =  CurState{e).  An  event  that  executes  a 
terminate  operation  is  called  a  terminate  event,  all  other  events  are  called  effective  events. 

For  any  computation  and  any  two  events,  e  and  e',  in  the  computation,  let  e'  >  e  denote 
that  e'  is  subsequent  to  e.  A  computation,  C,  is  said  to  have  terminated  at  e  if  e  £  C,  e  is 
an  effective  event,  and  for  any  event  e'  G  C  such  that  e'  >  e,  e'  is  a  terminate  event. 

We  consider  RAM  models  which  restrict  the  size  of  the  state  block  and  the  size  of  the 
communication  register.  First,  we  assume  that  the  state  register  is  finite  and  that  the  state 
block  has  a  fixed  number  of  computation  registers  of  clog n  bits  each,  where  c  is  a  constant 
and  n  is  the  size  of  the  input.  In  addition,  the  communication  register  is  assumed  to  have 
clogn  bits;  this  limits  the  number  of  bits  that  can  be  read  from  memory  in  a  single  operation. 

In  algorithm  design  it  is  common  to  specify  the  program  using  a  high  level  language  such 
as  Pascal.  The  state  register  would  then  correspond  to  the  program  counter.  Reading  a 
variable  translates  to  a  load  operation  while  writing  a  variable  translates  to  a  store  operation. 
Arithmetic  and  logical  operations  translate  to  local  RAM  operations.  A  statement  such  as 

V  :=  A  +  B 


where  V ,  A,  and  B  are  variables  corresponds  to  four  events  executing  {Load,  A),  {Load,  B), 
{Local)  and  {StoTe,V).  The  arithmetic  operation,  in  this  case  the  addition  of  A  and  B, 
could  be  computed  as  part  of  the  Store  event,  but  it  is  usually  more  convenient  to  consider 
it  to  be  performed  by  a  separate  event  executing  a  local  operation. 

Programs  are  designed  to  solve  problems.  A  problem  definition  comprises  a  class  of 
inputs  and  a  desired  outcome;  the  outcome  can  be  thought  of  as  a  property.  For  instance, 
an  algorithm  that  sums  all  the  elements  in  the  input  can  be  defined  to  accept  as  its  input 
an  array  of  integers.  We  can  define  a  computation  of  the  algorithm  to  be  correct  if  the  last 
event  that  writes  to  address  0  writes  the  sum  of  all  the  input  elements. 

More  formally,  for  any  computation,  5,  and  any  property  F,  property  F  holds  for  S  if 
there  is  some  event  e  G  5  such  that  for  any  event  e'  subsequent  to  event  e  property  F  holds 
after  e'. 

A  RAM  algorithm  is  defined  to  be  a  RAM  program,  TV,  together  with  a  class  of  allowable 
inputs,  1,  a  desired  property,  F,  and  a  class  of  allowable  computations,  C.  For  instance,  one 
natural  restriction  on  the  computations  is  that  the  initial  events  should  have  their  current 
state  equal  to  some  initial  value  (the  specification  of  initial  events  is  made  precise  later). 
An  algorithm  is  said  to  be  correct  if  for  any  input,  I  £  I,  property  F  holds  for  every 
computation,  C  6  C,  of  A^  on  input  /. 

In  the  following  sections  we  define  a  number  of  common  RAM  submodels  by  specifying 
their  corresponding  classes  of  allowable  computations. 

3.2      The  RAM  Family 

The  RAM  has  a  predefined  initial  state  called  InitState  and  a  predetermined  initial  value 
for  the  communication  register  called  InitComm  (e.g.  0).  A  sequential  compxitation  is  a 
computation  whose  events  are  chained.  More  formally,  a  computation  is  sequential  if  for 
any  f  >  1: 

(UNIQ)  Id{e,+r)  =  Id{e,). 

(SRAM-I)  CurState{e\)  —  InitState,  and  Comml(ei)  =  InitComm. 

(SRAM-C)  CurSiate(e,^i)  =  NewState{ei)  and  Comml(e,+i)  =  Comm2(e,). 

Next,  we  define  several  parallel  RAM  models.  For  any  computation,  C,  and  for  any 
number,  k,  let  Ck  be  the  subsequence  of  C  containing  all  the  events  with  Id  =  k;  so  Ck 
=  Cki ,  6^2'  •  •  •■  A  computation,  C,  is  called  a  parallel  computation  if  for  any  k,  Ck  satisfies 
(SRAM-I)  and  (SRAM-C),  where  in  SRAM-I  e^,  replaces  Ci,  and  in  SRAM-C  Ck,  and  e/,.^, 


replace  e,  and  e,+i,  respectively.  Ck  is  called  a  process  of  C.  It  is  convenient  to  assume  that 
the  identifiers  of  the  processes  of  a  computation  with  p  processes  are  numbered  from  0  to 
p  —  1.  A  process  is  said  to  have  terminated  at  event  e,  if  e  is  the  last  effective  event  of  the 
process.  Note  that  for  any  process,  P,  and  any  event  e  G  P,  if  op{e)  =  terminate  then  for 
any  e'  G  P  such  that  e'  >  e,  op(e')  =  terminate. 

A  synchronous  computation  is  a  parallel  computation  which  can  be  divided  into  contigu- 
ous segments  called  time  steps,  satisfying: 

1.  Each  segment  contains  exactly  one  event  from  each  process. 

2.  If  there  are  two  events  in  a  segment,  one  executing  a  load  (read)  operation  and  the 
other  a  store  (write),  both  using  the  same  address,  then  the  event  performing  the  load 
operation  precedes  the  event  performing  the  store  operation. 

The  sequential  RAM  is  equivalent  to  a  RAM  restricted  to  sequential  computations.  For 
parallel  computations,  an  algorithm  description  specifies  the  number  of  processes  used;  this 
can  be  a  function  of  the  input.  If  an  algorithm  is  defined  to  use  p  processes  this  is  equivalent 
to  the  restriction  that  a  computation  is  valid  only  if  it  has  at  most  p  processes.  The  PRAM 
is  equivalent  to  a  RAM  restricted  to  synchronous  computations.  We  define  the  APRAM  to 
be  a  RAM  restricted  to  parallel  computations.  We  note  that  the  PRAM  is  a  restriction  of 
the  APRAM. 

3.3      Parallel  Submodels 

The  PRAM  model  is  subdivided  into  submodels  which  differ  in  their  restrictions  on  access 
to  shared  memory.  Corresponding  subdivisions  can  be  defined  in  the  APRAM  model. 

Consider  any  computation,  C.  The  executed  before  relation  defined  below  captures  the 
notion  that  the  execution  of  an  event  had  an  effect,  direct  or  indirect,  on  the  execution  of 
another  event.  For  any  event,  s,  let  P{s)  denote  the  process  to  which  s  belongs.  For  any 
two  events,  ei  and  62,  ei  — »^  62,  if  one  of  the  following  holds: 

Tl.  P{ei)  =  P{e2)  and  ei  occurs  before  62  in  5. 

T2.  Event  €2  executes  a  load  operation  with  address  a,  and  ei  is  the  most  recent  event  in 
S  which  executes  a  store  with  address  a. 

T3.  There  is  an  event,  ea,  for  which  e^  —*  ej  and  63  — >  62- 

For  synchronous  computations,  in  addition  to  the  above  three  conditions,  €\  — >  £2  if 
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T4.  62  is  an  event  in  the  time  step  immediately  following  the  time  step  containing  e^. 

T5.  ci  is  executing  a  Load  operation  and  62  is  executing  a  Store  operation  to  the  same 
address  and  cj  and  €2  are  in  the  same  time  step. 

Conditions  T4  and  T5  correspond  to  the  synchronization  implicit  in  the  PRAM  model.  Two 
distinct  events,  ci  and  €2,  are  called  concurrent  if  e\  7^  62  and  €2  -/*  e\. 

We  say  that  an  event  accesses  a  variable,  x,  if  it  either  writes  to  x  or  reads  from  x.  A 
computation  is  said  to  be  exclusive  write  (resp.  exclusive  read)  if  for  any  variable,  x,  the 
events  writing  to  (resp.  accessing)  x  are  totally  ordered  with  respect  to  the  relation  — >.  A 
program  is  called  exclusive  write  (resp.  exclusive  read)  if  every  computation  of  the  program 
is  exclusive  write  (resp.  exclusive  read).  A  program  is  called  synchronously  exclusive  write 
(resp.  synchronously  exclusive  read)  if  every  synchronous  computation  is  exclusive  write 
(resp.  exclusive  read).  The  CREW  PRAM  (resp.  the  EREW  PRAM)  is  a  synchronously 
exclusive  write  (resp.  synchronously  exclusive  read)  RAM. 

When  using  the  CRCW  PRAM  model  one  must  define  how  write  conflicts  are  resolved. 
Several  submodels  appear  in  the  literature;  the  more  popular  ones  are  the  Priority,  Arbitrary 
and  Common.  In  the  Priority  model,  each  process  is  assigned  a  priority.  When  several 
processes  attempt  to  write  to  the  same  memory  location  at  the  same  time  the  process 
with  highest  priority  succeeds.  In  the  Arbitrary  model,  when  several  processes  attempt  to 
write  into  the  same  memory  location  at  the  same  time  one  of  them  succeeds  but  it  is  not 
known  which;  this  introduces  non-determinism  to  the  computation.  In  the  Common  model, 
processes  are  allowed  to  write  simultaneously  to  the  same  location  only  if  they  all  write  the 
same  value. 

Define  the  synchronous  APRAM  to  be  an  APRAM  restricted  to  synchronous  computa- 
tions. The  same  variations  can  be  carried  over  to  the  synchronous  APRAM  model.  For  the 
synchronous  APRAM  priority  model,  each  process  is  assigned  a  priority.  A  synchronous 
APRAM  computation  is  called  Prioritized  if  for  any  two  events,  ci  and  62,  which  write  to 
the  same  location  in  the  same  time  step,  if  the  priority  of  the  process  executing  Ci  is  higher 
than  that  of  the  process  executing  62,  then  ci  appears  after  62  in  the  computation.  For  the 
common  model,  we  require  that  for  any  two  events,  cj  and  62,  if  they  both  write  to  the  same 
location  in  the  same  time  step,  they  write  the  same  value. 

Remark.  We  are  not  showing  how  to  implement  the  various  PRAM  models  in  the  APRAM 
model.  Rather,  we  are  showing  that  appropriate  restrictions  on  the  APRAM  model  yield 
the  various  PRAM  models. 


4     The  Rounds  Complexity  Measure 

The  PRAM  is  a  synchronous  model.  In  using  the  PRAM  as  a  programming  model,  one  in 
effect  assumes  the  existence  of  a  global  clock  to  which  all  the  processes  synchronize.  It  is 
natural  to  use  this  clock  as  a  measure  of  algorithmic  complexity.  The  APRAM  does  not 
assume  the  existence  of  a  global  clock;  therefore,  we  need  complexity  measures  that  can 
replace  the  running  time  complexity  of  synchronous  models.  These  new  measures  should 
reflect  the  elapsed  real  time  from  the  start  of  the  algorithm  until  its  termination  if  it  were 
implemented  in  an  asynchronous  environment. 

For  any  PRAM  algorithm,  A,  let  T{n)  and  P{n)  denote  the  parallel  running  time  and 
the  number  of  processes  used  by  the  algorithm  on  inputs  of  size  n.  The  number  of  effective 
events  in  any  synchronous  computation  of  A  on  inputs  of  size  n  is  at  most  P{n)-T{n).  In  this 
respect,  the  length  of  a  synchronous  computation  gives  some  indication  of  the  algorithm's 
complexity.  This  need  not  be  the  case  for  asynchronous  algorithms.  For  example,  if  one 
process,  Pi,  is  assigned  to  set  a  global  variable,  x,  (which  is  initially  reset),  and  all  other 
processes  wait  for  the  variable  to  be  set  (e.g.  by  repeatedly  checking  the  value  of  x)  one  can 
create  arbitrarily  long  computations  by  delaying  the  step  of  Pj  that  sets  x.  One  can  view 
the  frequency  of  events  from  a  process  P  in  a  segment  of  a  computation  as  the  process's 
speed.  This  example  shows  that  it  is  possible  for  the  length  of  a  computation  to  increase 
when  the  speed  of  a  subset  of  the  processes  is  increased. 

4.1      The  Virtual  Clock 

The  running  time,  or  parallel  time,  complexity  used  in  analyzing  PRAM  algorithms  cor- 
responds to  the  number  of  time  steps  in  a  synchronous  computation.  One  approach  in 
asynchronous  models  is  to  use  a  virtual  clock  (or  logical  clock  [Lam78]).  This  approach  was 
introduced  in  [PF77]  and  used  in  [AFL83,LF81]  and  is  common  in  the  area  of  distribut- 
ed computing  (see  [Awe87,Awe85,AG87]).  Consider  a  computation,  C.  A  virtual  clock  of 
C  is  an  assignment  of  unique  virtual  times  to  the  events  of  C;  the  times  assigned  are  a 
non- decreasing  function  of  the  event  number.  The  time  difference  between  two  consecutive 
events  of  a  process  is  called  the  duration  of  the  later  event.  The  complexity  of  a  computation 
is  the  time  assigned  to  the  last  effective  event  in  the  computation. 

One  can  obtain  variations  in  the  complexity  measure  by  restricting  the  allowable  virtual 
clocks.  One  such  variation,  called  the  rounds  complexity,  requires  the  duration  of  each  event 
to  be  at  most  1.  In  effect,  it  assumes  that  each  operation  takes  at  most  one  unit  of  time,  but 
is  allowed  to  take  less.  Other  variations  for  measuring  the  implicit  costs  of  synchronization 
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are  presented  in  [CZ90a]. 

Let  C  be  any  computation  and  T  any  virtuaJ  clock  of  C.  For  each  event,  e,  the  time 
of  e,  i(e),  is  the  time  assigned  to  e  by  T.  We  call  the  integer  part  of  the  time  of  e  its 
round  number,  r-d(e),  rd{e)  =  [<(e)J.  The  virtual  clock  in  effect  divides  the  computation 
into  contiguous  segments,  called  rounds,  so  that  each  segment  contains  at  least  one  event 
from  each  process.  A  round  is  called  effective  if  it  has  at  least  one  effective  event. 

The  rounds  complexity  of  a  computation  is  the  number  of  effective  rounds  in  a  virtual 
clock  of  the  computation  maximized  over  all  possible  virtual  clocks.  The  rounds  complexity 
of  an  algorithm  for  a  given  input,  /,  is  the  maximum  rounds  complexity  over  all  possible 
computations  on  input  /.  The  rounds  complexity  of  an  algorithm  for  inputs  of  size  n  is  then 
defined  to  be  the  maximum  rounds  complexity  over  all  inputs  of  size  n.  Note  that  each  time 
step  of  a  synchronous  computation  is  a  round  which  contains  exactly  one  event  from  each 
process. 

We  wish  to  stress  that  our  model  refers  to  processes  and  not  processors.  Brent's  theorem 
[Bre74]  justifies  the  use  of  processes,  when  designing  PRAM  algorithms,  without  regard  to 
the  actual  number  of  physical  processors  at  hand.  An  analogue  of  Brent's  theorem  applies 
to  the  APRAM  model: 

Lemma  A.l  A  t  round,  p  process  algorithm  can  be  simulated  using  q  <  p  processes  in 
0{tp/q)  rounds. 

Proof.  The  processes  of  the  p  process  algorithm,  Ap,  are  distributed  evenly  among  the  q 
processes  of  the  simulating  algorithm,  Ag.  Each  process  of  A^  receives  either  \p/q]  or  [p/q\ 
processes  of  Ap.  Each  process  of  A,  simulates  its  assigned  processes  in  round  robin  fashion. 
Consider  a  computation,  C,  of  Ag.  There  is  a  one  to  one  mapping  from  the  events  of  C  to  a 
computation  of  Ap.  Partition  C  into  phases  of  \p/q]  rounds.  Clearly,  each  phase  is  a  round 
of  Ap.  Hence,  Ag  uses  at  most  tlp/q]  rounds.  D 

So  far  we  have  defined  the  model  and  the  complexity  measures  in  terms  of  events. 
Requiring  this  low  level  detail  may  make  the  design  process  tedious  and  the  analysis  unduly 
complicated.  Instead,  we  group  statements  into  larger  constructs  called  superevents.  A 
process,  p,  is  said  to  execute  a  superevent,  E,  if  the  events  comprising  E  appear  as  a 
segment  of  p.  We  allow  each  superevent  to  contain  up  to  r  events,  for  some  fixed  number, 
r,  independent  of  the  input.  We  then  define  a  superround  to  be  a  segment  in  which  each 
process  executes  at  least  one  superevent. 

Every  segment  comprising  2r  -  1  rounds  contains  at  least  one  superevent.  Therefore, 
an  algorithm  which  has  complexity  0{R)  superrounds  has  complexity  0{rR)  rounds,  which 
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for  r  a  constant  is  0{R)  rounds.  We  ignore  constant  factors,  as  is  often  done  in  asymptotic 
analysis,  and  henceforth  use  superevents  and  superrounds.  When  no  ambiguity  arises,  we 
use  the  term  round  to  refer  to  a  superround. 

The  complexity  of  an  APRAM  algorithm  is  measured  in  terms  of  the  pair  [round  com- 
plexity, number  of  processes].  This  is  intended  to  correspond  to  the  measure  [time,  processes] 
used  for  the  PRAM  model.  The  aim  is  that  this  measure  of  complexity  should  correspond 
roughly  to  a  [time,  processors]  complexity  measure  on  more  realistic  machines.  The  notion 
of  rounds  is  far  from  new;  it  is  used  extensively  when  analyzing  distributed  (asynchronous) 
algorithms;  however,  in  distributed  algorithms,  typically,  the  other  component  of  the  com- 
plexity is  the  number  of  messages  transmitted.  We  are  interested  in  a  more  tightly  coupled 
form  of  processing,  which  is  characteristic  of  parallel  computation.  We  note  that  the  vir- 
tual clock  is  used  solely  for  the  round  analysis;  the  correctness  arguments  must  prove  the 
algorithm  correct  regardless  of  the  division  into  segments. 

We  illustrate  the  round  complexity  by  means  of  two  simple  examples:  computing  the 
sum  of  n  integers  given  in  an  array  and  a  recursive  doubling  algorithm. 

4.2      Summation  Algorithm 

The  APRAM  summation  algorithm  given  below  is  very  similar  to  the  known  PRAM  algo- 
rithm. It  uses  2n  —  1  memory  cells  arranged  in  an  implicit  complete  binary  tree  with  the 
input  elements  at  the  leaves.  A  process  is  associated  with  each  internal  node  of  the  tree. 
Each  memory  cell  has  a  valid  bit  which  is  initialized  to  true  for  all  the  input  elements  (the 
leaves)  and  false  for  all  internal  nodes.  For  each  process,  z,  V,  is  the  internal  node  associated 
with  i,  and  i,  and  J?,  are  the  left  child  and  right  child  of  V,,  respectively. 

The  algorithm  iterates  the  following  superevent  comprising  a  condition  test  plus  a  pos- 
sible execution  of  the  if  statement. 

Algorithm  for  process  i: 

1  while  {Vi  is  not  valid)  do 

2  if  Ri  and  Z,  are  valid  then 

3  set  Vi  :=  Li  +  Ri 

4  set  the  tag  of  Vi  to  valid 

5  end  if 

6  end  while 

Once  both  children  of  a  node  are  valid  the  process  associated  with  the  node  executes 
only  one  more  iteration,  during  which  its  associated  node  is  validated.  As  the  depth  of  the 
tree  is  [log2n],  and  each  process  executes  at  least  one  iteration  at  each  round,  the  number 
of  rounds  required  is  [logj  n].  We  have  shown: 
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Theorem  4.1  There  is  an  APRAM  algorithm  which  computes  the  sum  ofn  elements  given 
in  an  array  in  O(logn)  rounds  using  n  —  \  processes. 

The  above  algorithm  can  easily  be  modified  to  use  only  n/logn  processes  while  achieving 
the  same  asymptotic  bound.  It  is  also  straightforward  to  extend  this  result  to  obtain  a 
parallel  prefix  algorithm  with  the  same  complexity. 

4.3      Recursive  Doubling 

The  recursive  doubling  algorithm  takes  as  input  n  elements  given  in  an  array.  Each  element, 
u,  has  a  pointer,  p{u),  which  points  to  its  successor,  an  element  in  the  array.  An  element 
is  said  to  be  the  head  of  a  list  if  p{u)  =  u;  it  is  assumed  that  p  does  not  have  non-trivial 
cycles.  The  recursive  doubling  algorithm  is  the  heart  of  most,  if  not  all,  of  the  known 
parallel  list  ranking  algorithms  [Wyl79,AM88,CV86,CV88,MS90].  We  do  not  present  a  list 
ranking  algorithm  here,  for  it  involves  questions  regarding  atomicity  which  we  do  not  intend 
to  discuss  presently. 

The  APRAM  algorithm  for  recursive  doubling  requires  no  synchronization  and  is  iden- 
tical to  the  simple  PRAM  algorithm  due  to  Wyllie  [Wyl79]. 

Algorithm  for  process  i 

1  while  (p(p(i))  7^  p(0)  do 

2  p{i)  :=  p{p{i)) 

Statement  1  checks  for  termination  and  Statement  2  updates  the  value  of  p{v).  The  correct- 
ness of  the  algorithm  follows  from  the  invariant  that  for  any  element,  w,  p(u)  is  an  element 
that  was  an  ancestor  of  u  in  the  input  list. 

For  any  two  elements,  u  and  v,  define  the  distance  between  u  and  v  to  be  the  number  of 
edges  that  must  be  traversed  in  the  input  list  to  get  from  u  to  v.  Let  C  be  any  computation 
and  define  a  superevent  to  comprise  a  complete  iteration  of  the  while  loop.  For  any  two 
elements,  u  and  v,  of  the  list  and  any  number,  i,  let  p,(u)  be  the  successor  of  u  after  round 
i  and  let  D{u,v)  be  the  distance  from  u  to  v.  The  recursive  doubling  algorithm  maintains 
the  invariant  that  for  any  number  i  and  for  any  element  u,  if  after  round  i  the  successor  of 
u  is  not  a  head  of  a  list,  then 

D(u,p,+i(u))  >  D{u,p,(u))  +  D{p,(u),p,{p,{u))). 

Since  D{u,po{u))  —  1,  after  round  i  every  element  which  does  not  point  to  the  end  of  its 
list  points  to  an  element  at  distance  at  least  2'.  We  conclude: 
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Theorem  4.2  Given  n  elements  in  an  array,  forming  a  collection  of  linked  lists,  the  recur- 
sive doubling  algorithm  computes  for  each  element  the  end  of  its  list  using  n  processes  in 
O(log  n)  rounds. 

The  two  algorithms  analyzed  above  are  simple  and  straightforwart.  A  companion  paper, 
[CZ90b],  presents  an  n  +  e  process,  O(logn)  rounds  APRAM  algorithm  for  computing  the 
connected  components  of  an  undirected  graph  with  n  vertices  and  e  edges;  this  algorithm  is 
substantially  different  from  the  known  PRAM  algorithms,  and  its  analysis  is  quite  intricate. 
We  remind  the  reader  that  the  fastest  PRAM  algorithms  for  graph  connectivity  run  in 
O(log(n  +  m))  time  on  a  CRCW  PRAM.  In  Shiloach  and  Vishkin  [SV82],  n  +  m  processes 
are  used;  in  Cole  and  Vishkin  [CV87]  this  is  reduced  to  (n  +  m)Q(m,  n)/log(n  +  m)  processes, 
but  at  the  cost  of  large  constants  resulting  from  the  presence  of  expander  graphs. 

5  APRAM  Simulations 

The  tree  computation  used  in  the  summation  algorithm,  as  well  as  the  list  based  recur- 
sive doubling  algorithm,  can  be  used  to  execute  a  barrier-like  synchronization  among  the 
processes  in  O(logn)  rounds.  From  this  it  follows  that: 

Theorem  5.1  Any  PRAM  algorithm.  A,  with  parallel  running  time  T{n)  using  P{n)  pro- 
cesses can  be  simulated  by  an  APRAM  algorithm  which  uses  P{n)  processes  and  has  round 
complexity  0(T(n)logn). 

By  performing  a  simulation  on  P'  <  P{n)/  log  P{n)  processes,  and  performing  the  barrier- 
like synchronization  every  \og{P{n)/P')  rounds,  one  can  obtain: 

Theorem  5.2  (Optimal  Simulation)  Any  PRAM  algorithm.  A,  with  parallel  running 
time  T{n)  using  P{n)  processes  can  be  simulated  by  an  APRAM  algorithm  which  uses  P' 
processes,  P'  <  P{n)/\ogP{n),  and  has  round  complexity  0{T{n)P{n)/ P'). 

For  any  PRAM  algorithm,  A,  the  APRAM  algorithm  obtained  by  inserting  a  barrier-like 
synchronization  after  each  statement  is  called  the  APRAM  simulation  of  A.  The  multiplica- 
tive factor  of  O(logn),  introduced  by  the  simulation,  can  often  be  avoided,  as  we  have 
already  seen  for  the  summation  and  recursive  doubling  problems. 

6  Conclusions 

This  paper  gives  a  formal  definition  of  the  APRAM  model  and  its  associated  complexity 
measure,  the  rounds  complexity.  The  APRAM  model  is  an  asynchronous  generalization  of 
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the  RAM  and  PRAM  models.  It  is  introduced  in  order  to  study  the  synchronization  costs 
of  shared  memory  parallel  computation,  costs  which  are  hidden  by  the  synchronous  PRAM 
model. 

The  APRAM  does  not  provide  a  global  clock  by  which  all  the  processes  can  synchronize. 
Instead,  the  required  synchronization  must  be  coded  into  the  algorithm.  We  have  shown 
that  careful  design  may  lead  to  efficient  asynchronous  algorithms;  one  method  demonstrated 
in  this  paper  replaces  global  synchronization  by  locaJ  synchronization. 

When  a  PRAM  algorithm  is  run  on  an  asynchronous  machine,  it  is  necessary  to  syn- 
cronize  all  the  processes  after  each  step.  We  have  shown  that  this  synchronization  may 
increase  the  rounds  complexity  of  the  algorithm  by  a  multiplicative  factor  of  O(log  7? ),  where 
n  is  the  size  of  the  input.  Using  an  asynchronous  model,  such  as  the  APRAM,  favors  algo- 
rithms with  fewer  synchronization  requirements  yielding  algorithms  which  can  be  run  faster 
in  asynchronous  environments. 
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